Speaker dependent bottleneck layer training for speaker adaptation in automatic speech recognition
نویسندگان
چکیده
Speaker adaptation of deep neural networks (DNN) is difficult, and most commonly performed by changes to the input of the DNNs. Here we propose to learn discriminative feature transformations to obtain speaker normalised bottleneck (BN) features. This is achieved by interpreting the final two hidden layers as speaker specific matrix transformations. The hidden layer weights are updated with data from a specific speaker to learn speaker-dependent discriminative feature transformations. Such simple implementation lends itself to rapid adaptation and flexibility to be used in Speaker Adaptive Training (SAT) frameworks. The performance of this approach is evaluated on a meeting recognition task, using the official NIST RT’07 and RT’09 evaluation test sets. Supervised adaptation of the BN layer shows similar performance to the application of supervised CMLLR as a global transformation, and the combination of these appears to be additive. In unsupervised mode, CMLLR adaptation only yields 3.4% and 2.5% relative word error rate (WER) improvement, on the RT’07 and RT’09 respectively, where the baselines include speaker based cepstral mean and variance normalisation. The combined CMLLR and BN layer speaker adaptation yields a relative WER gain of 4.5% and 4.2% respectively. SAT style BN layer adaptation is attempted and combined with conventional CMLLR SAT, to show that it provides a relative gain of 1.43% and 2.02% on the RT’07 and RT’09 data sets respectively when compared with CMLLR SAT. While the overall gain from BN layer adaptation is small, the results are found to be statistically significant on both the test sets.
منابع مشابه
Speaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملACOUSTIC MODEL ADAPTATION FOR AUTOMATIC SPEECH RECOGNITION AND ANIMAL VOCALIZATION CLASSIFICATION by
ACOUSTIC MODEL ADAPTATION FOR AUTOMATIC SPEECH RECOGNITION AND ANIMAL VOCALIZATION CLASSIFICATION Jidong Tao, B.Eng., M.S. Marquette University, 2009 Automatic speech recognition (ASR) converts human speech to readable text. Acoustic model adaptation, also called speaker adaptation, is one of the most promising techniques in ASR for improving recognition accuracy. Adaptation works by tuning a g...
متن کاملRemes Speaker - Based Segmentation and Adaptation in Automatic Speech Recognition
With proper training, automatic speech recognition works quite well when tested in conditions similar to the training conditions, but with a new speaker or a new environment the system performance often degrades. Speaker-based adaptation alters the speech recognition system to better match a specific speaker and thus improves the speech recognition results. In order to use speaker adaptation, t...
متن کاملAn Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation
Different training and adaptation techniques for multilingual Automatic Speech Recognition (ASR) are explored in the context of hybrid systems, exploiting Deep Neural Networks (DNN) and Hidden Markov Models (HMM). In multilingual DNN training, the hidden layers (possibly extracting bottleneck features) are usually shared across languages, and the output layer can either model multiple sets of l...
متن کامل